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Abstract 

Using the cavity method we consider the learning of noisy teacher- 
generated examples by a nonlinear student perceptron. For insuffi- 
cient examples and weak weight decay, the activation distribution 
of the training examples exhibits a gap for the more difficult exam- 
ples. This illustrates that the outliers are sacrificed for the overall 
performance. Simulation shows that the picture of the smooth en- 
ergy landscape cannot describe the gapped distributions well, im- 
plying that a rough energy landscape may complicate the learning 
process. 



1 Introduction 

The learning of noisy examples by a nonlinear perceptron is a frustrating process, in 
the sense that competing information extracted from the training examples needs 
to be processed Q. Since learning usually involves minimizing a cost function, 
the learner often has to choose between interpolating the conflicting examples or 
sacrificing some in favor of satisfying others, so as to attain a minimum overall cost. 
This kind of competition is especially marked in nonlinear perceptrons. 

This sacrificial effect is an important issue. As we shall see, it will lead to a gap in the 
activation distribution of the examples. The activations of the sacrificed examples 
are separated from those of the satisfied ones by a wide margin, and different choices 
of partitioning the sacrificed and preferred examples correspond to different local 
minima in the energy landscape. The appearance of a multiplicity of local minima 
leads to the roughening of the energy landscape, rendering the learning processes 
prone to being trapped before reaching the global minimum. 

The effect is best understood using the cavity approach ||, g|. It uses a self- 
consistency argument to consider what happens when a new example is added to 
the training set. When the learner adopts a sacrificial learning strategy, learning 
the added example may lead to different choices of the sacrificed examples in the 
background, causing a shift of the global energy minimum among a number of local 
minima. 



The cavity method yields identical macroscopic predictions with the replica method 
M . The assumption of a smooth energy landscape corresponds to the replica sym- 



metric ansatz in the replica approach. Instability appears in the ansatz when the 
Almeida-Thouless condition is violated beyond which replica symmetry break- 
ing solutions have to be introduced, corresponding to a rough energy landscape. As 
we shall see, the appearance of the gap is closely related to the Almeida-Thouless 
line. 

In this paper, we analyze the learning of noisy examples by a nonlinear perceptron 
using the cavity method. We study the activation distribution of the training exam- 
ples and find parameter regimes with gaps in the distribution. Simulation results 
show that the assumption of a smooth energy landscape works well when no gaps 
are present, but fails when gaps appear. 



2 The cavity method 



2.1 Formulation 



The rule to be learned is generated by a teacher perceptron with N weights Bj, 
j = 1, ...,N and = 1. The student perceptron, with N weights Jj, j — 1, N, 
tries to model the teacher by learning from a set of p examples. Each example, 
labeled (i with fi = 1, ...,p, consists of an input vector £ M and the noisy output O m 
of the teacher. The input components are random variables, with = and 

The teacher output is a nonlinear activation function of the teacher activation 
Uft. = B ■ ^/y/N, corrupted by a Gaussian noise r)^ with (r/^) = and (rfy = 1. 
That is, Op = f(y^ + Trj^), where T is the noise level, and here we use the sigmoid 
function f(x) = [1 + e^' 1 ]^ 1 . Correspondingly, the student models the teacher 
outputs by ffj, = f(Xfj,), where the student activation is x^ = J ■ £ M /vN. 

Learning is attained by minimizing the energy function which consists of the errors 
of the student in reproducing the teacher's outputs for the training set, as well as 
the penalty term for excessive complexity. Hence we use the energy function 

M 3 

where A is the weight decay strength. Minimizing the energy function by gradient 
descent, the student reaches the equilibrium state 



2.2 Adding an example 

If an example is fed to the student, the activation 

_ j-e 

is called cavity activation. Since the student J has no information about the exam- 
ple, the cavity activation is a Gaussian variable for random inputs Its mean, vari- 
ance and covariance with the teacher activation of example are given by ((to)) = 0, 
((to)) = q and ((toj/o)) = R respectively, where ((•)) denotes the ensemble average, 
and the parameters q and R are defined by 

q=(jf) and R=(J j B j ). (2) 



Hence in the large N limit, the cavity activation can be expressed as to — Ryo + 
yj q — R? Co j where £o is a Gaussian variable with mean and variance 1. 

Now compare the student J with another one which incorporates example in the 
training set, denoted by J°. The generic student activation xq = J° ■ £° /y/N is 
no longer a Gaussian variable. Nevertheless, it is reasonable to expect that the 
difference between the students J and «7° is small. Following the perturbative 
analysis in ||, we conclude that xq is a well defined function of to, given by 

to = x - 7(O - /o)/o, (3) 
where 7 is the local susceptibility given by 

1 -*=°( 1 -^> 1 . ; M 

For the nonlinear perceptron, it is possible that for sufficiently large 7, the generic 
activation xq is a multi-valued function of the cavity activation in. In this case 
we have to choose the one which minimizes the energy function in (|l|). The cavity 
method shows that the energy increase on adding example is 

A£= \(0 -fo) 2 + ^(x -t ) 2 . (5) 

The first term is the primary change due to the added example, and the second term 
is due to the adjustment of the background examples. In the multi-valued region 
one needs to compare those solutions whose values of xq are closer to to (therefore 
favorable in the background adjustment) with those whose outputs fo are closer to 
the teacher's outputs Oo (therefore favorable in the primary cost). This competition 
leads to a discontinuity in the range of the activation xo when the cavity activation 
to varies, accompanied by the appearance of gaps in the activation distribution. 



2.3 Adding an input 

Similarly, the cavity method can be used to analyze the changes when an input 
is added to the rule. In this case, the teacher outputs are given by 0° = /(y M + 

Bo^o/VN + Trjfj), where y^ is the original teacher activation with N inputs. If 
the student has inputs 1 to N only, the resultant student perceptron is given by 

^ = E M (o°-/ M )/^/(AViv). 

We may construct the weight for the student perceptron using the same prescrip- 
tion, namely 

However, this is not the generic weight since in the activation functions f^, the 
arguments x^ do not contain the input 0; nor is the information fed from input 
ever being utilized in the learning of x^. Hence Zq is called the cavity weight. 

Now compare the student with another one with inputs to N, and which incor- 
porates input in all the training examples. Its weights are denoted by J® for 
j = l,...,N and J for input 0. Jo is different from Z - Nevertheless, it is also 
reasonable to expect that the difference between the students Jj and J° is small. 
Using the perturbative analysis, we conclude that Jo is a well defined function of 
Z , given by 



Jo = jXZq. 



(6) 



The cavity weight distribution P(Zq\Bq) is a Gaussian with mean and variance 
given by 

0' f 



((Zo)) 



a \ i + i[f'u 2 -(o^-fM 



Bo, 



(z 2 )) - {(z )) 2 = ^{(o, - uff 2 ),. 



(7) 
(8) 



2.4 Macroscopic parameters 



Making use of the relation (0), we can obtain the self-consistent equations for the 
macroscopic parameters 7, R and q in (0) and (0), 



I-7A 
R 



cry 



r(x) 2 - [/(x/t+t^) - f(x)]f"( X ) 

i + 1 {f( x y - [f(VTTr^u) - f(x)]f"(x)}' 



-DuDv, 



(9) 



1 + 7 {/'(x) 2 - [/(vT+T*u) - f(x)]f"(x)} 



R 2 = «7 2 // [fWl + T*u) - f(x)] 2 f(x) 2 VuVv, 



DwDv, (10) 



(11) 



where Du =du exp[— u 2 /2]/^/2tt and D-y =dii exp[— w 2 /2]/v / 27r are two independent 
Gaussian measures. Both the noise-corrupted teacher activation and the cavity 
activation are Gaussian distributed and determined from u and v via 



y 



+ Tr]= ^1 + T 2 u and t = 



R 



and the generic activation x is given by the solution of 



I? 2 



1 + T 2 



t = x- 1 [f(y/l + T 2 u) - f(x)]f'(x). 



(12) 



(13) 



The progress of learning is monitored by the training error e* and the generalization 
error e ff , which are respectively the root mean square errors for a training example 
and an arbitrary example, 



[f(VT+T^u)-f(x)] 2 DuDv, 

R 



f(Vl + T?u)-f 



Vl + T^ 



R 2 



1 + T 2 



(14) 



BuBv. (15) 



2.5 Stability condition 

The validity of the perturbation approach can be checked by considering the stability 
condition of the equilibrium state. When example is added, the amplitude of the 
change in the student vector is given by 

(x - t ) 2 



E(^-) 2 = — /, , 2 \ ■ (16) 



i 1 
Hence the stability condition is 




(17) 



This is identical to the stability condition of the replica-symmetric ansatz in the 
replica approach, the so-called Almeida-Thouless (AT) condition. In particular, we 
note that when a gap is present in the activation distribution, discontinuous 
function of and the system becomes unstable. 

3 The activation distribution 

Gaps in the activation distribution appear for values of local susceptibility 7 > 16.6, 
when a single value of the cavity activation in (^) may map onto multiple values of 
the generic activation, corresponding to different energy minima. When the energy 
minimum favors the generic activation to take a value closer to the teacher activation 
than the cavity activation, the example is satisfied. Otherwise, when the generic 
activation is closer to the cavity activation, the example is sacrificed. 

As shown in Fig. 1(a), sacrificial learning first occurs at the extreme values of the 
teacher output O. This is because in nonlinear perceptrons, changes in the student 
activation around these extreme values of O do not result in significant changes 
in the training error of an example, and if the cavity activation is very different 
from the teacher's, it is more economical to keep the student activation close to 
the cavity activation, so that the background adjustment remains small. Hence 
sacrificial learning is a unique consequence of the nonlinearity of the perceptron 
output. In contrast, no sacrificial learning is present in linear perceptrons, even 
when perfect learning is impossible B . 




Figure 1: (a) The occurrence of sacrificial learning when 7 = 18.6. No values of 
the student output exist in the shaded region. Here a — 3, T = 2 and A = 0.002. 
Dotted line: y + Trj = —5, corresponding to the distribution in Fig. 2(b). (b) 
Regions of the existence of gapped activation distribution for different noise levels. 
The point a = 3, A = 0.002 is denoted by a star. 

In Fig. 1(a), no values of the student output in the shaded region exist. For 
O < 0.078, student activations to the left of the shaded region correspond to the 
satisfied examples, whereas those to the right correspond to the sacrificed ones. For 
intermediate values of O, the competitive effects are less, and there are no gaps 
developed. 

Figure 1(b) shows the regions for the existence of gapped activation distributions. 



They are closely related to the unstable regions which violate the condition (|17|). 
The gapped regions lie inside the unstable regions, since the development of a 
gap is already sufficient to cause an uncontrollable change when a new example is 
added. However, provided that a is not too small, the boundaries of the gapped 
and unstable regions are very close to each other. 

Figure 1(b) shows that frustration is serious when the training set size is small, lead- 
ing to gapped activation distributions. When the training examples are sufficient, 
the underlying rule can be extracted with confidence, thereby restoring the con- 
tinuous distribution. Furthermore, increasing the data noise broadens the gapped 
region. Indeed, noisy data introduces competing information to be learned by the 
student, and hence increases the degree of frustration. On the other hand, the 
gapped region narrows with increasing weight decay strength. Arguably, weight 
decay restricts the flexibility in the weight space, thus reducing the tendency for 
multiple minima. 




Figure 2: Student activation distribution at a — 3, and A = 0.002 (denoted by a 
star in Fig. 1(b)). (a) T = 0.1 and y + Tr\ = -1; (b) T = 5 and y + Tr\ = -5. 
Note that in (b) the gap from the simulation is broader than the theoretical range 
[1.19,2.68] (arrows, see Fig. 1(a)). 

Figure 2(a) shows a typical activation distribution in the region of continuous distri- 
bution, where 7 = 11.1 and the stability condition ( [l7| ) is fulfilled. Comparing with 
the simulation result, we see that the assumption of a smooth energy landscape used 
in the present work is valid in this region. The theoretical and simulational results 
of et and e g also agree. In contrast, the gapped distributions in Fig. 2(b) show that 
the assumption does not well describe the simulation result when 7 = 18.6 and the 
stability condition (|l7]) is violated. There are prominent differences of e< and e g 
between theoretical and simulational results. To improve the agreement, a rough 
energy landscape as discussed in |3j must be introduced. 

4 Conclusion 

We have demonstrated the existence of band gaps in the activation distribution, 
and attributed them to frustrations arising from the competition of conflicting in- 
formation inherent in noisy data, and the nonlinearity of the student perceptron. 
Activations corresponding to sacrificed or satisfied examples during learning are 



seperated by band gaps. The existence of band gaps necessitates the picture of a 
rough energy landscape. In the picture of the replica approach, it corresponds to 
the replica symmetry breaking ansatz beyond the Almeida-Thouless line. 

We remark that the sacrificial effects arc common in many other cases, such as 
multilayer perceptrons jq] and weight pruning It may also exist in Support 
Vector Machines (SVM) when examples are noisy and insufficient They may 
create local minima which complicate the convergence of learning processes. Hence 
it is an issue that should be considered both theoretically and practically. 

Acknowledgments 

This work is supported by the Research Grant Council of Hong Kong (HKUST6157 
/99P). 

References 

[1] M. Mezard, G. Parisi and M. Virasoro, Spin Glass theory and beyond, World Scientific, 
1987. 

[2] K.Y.M. Wong, Europhys. Lett. 30, 245 (1995). 

[3] K.Y.M. Wong, Advances in Neural Information Processing Systems, 9, 302, M.C. 
Mozer, M.I. Jordan and T. Pestsche, eds., MIT Press, 1997. 

[4] S. Bos, W. Kinzel and M. Opper, Phys. Rev. E 47, 1384 (1993). 

[5] K.Y.M. Wong, Theoretical Aspects of Neural Computation, K.Y.M. Wong, I. King 
and DY. Yeung, eds., Springer, Singapore, 1998. 

[6] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, 1995. 



